Input Parsing

I have made the following assumptions regarding how the words in a indian language (well, not all, but most of them) are assembled. (In the following discussion, alphabet refers to the list of basic characters in the language script, a letter refers to an element from the complete list of characters—basic plus all the composite forms.)

The alphabet is divided into two groups: vowels and consonants.

Each indian language letter may be a vowel, or a consonant-vowel pair, or a ligature-vowel pair.

Ligatures are sequences of consonants.

If a consonant xxx is followed by another consonant yyy, then it is assumed to imply a half-consonant, i.e., the half-form of xxx must be displayed at that point. Of course, if the ligature for the pair xxx-yyy exists, the ligature is used instead. Note that not all indian languages make use of ligatures.

Apart from the vowels and consonants, some special forms are also provided, such as the chandra-bindu, the anuswara, virama, etc. These special forms always form the suffix on the letter they affect, i.e., you specify the letter first, then the special form. A bunch of letters separated by white space or punctuation forms a word.

Based on these assumptions, a simple parser has been built, to recognize the basic unit—the letter.